{ "cells": [ { "cell_type": "markdown", "id": "ba8527d4-7f70-4ed2-986e-9f2a48967690", "metadata": {}, "source": [ "# Custom Environment" ] }, { "cell_type": "markdown", "id": "2bc6c081-51dd-4df3-8a8c-f54ffb1feeaa", "metadata": {}, "source": [ "You can find a repo with a corresponding example/template at [https://github.com/rl-tools/example](https://github.com/rl-tools/example)" ] }, { "cell_type": "markdown", "id": "ae6ddca1-73c8-4280-8760-a89ddc9bf24c", "metadata": {}, "source": [ "As always we first include the elementary operations" ] }, { "cell_type": "code", "execution_count": 1, "id": "cc4f594c-fd4f-4e92-95ed-066d96010587", "metadata": {}, "outputs": [], "source": [ "#define RL_TOOLS_BACKEND_ENABLE_OPENBLAS\n", "#include \n", "#include \n", "#include \n", "namespace rlt = rl_tools;\n", "#pragma cling load(\"openblas\")" ] }, { "cell_type": "markdown", "id": "ec2f3181-7f15-4040-bf8b-63f786b3bcd5", "metadata": {}, "source": [ "Next we define the datastructures for our new environment. The main data structures are for the state of the environmnet (`MyPendulumState`) and for the environment itself (`MyPendulum`). As usual in RLtools, we assemble all template parameters of the environment into a specification (`MyPendulumSpecification`) so that we do not need to repeat them in every function template. Furthermore we separate out the parameters. With RLtools environments we distinguish between three levels of \"state\":\n", "- **Environment**: Compile-time, should not change at runtime (though this is is not enforced to allow for hackability)\n", "- **Parameters**: Constant throughout an episode. This allows for e.g. domain randomization. It can also carry cues for the visualization (the constness during an episode is also not enforce but considered a best practice)\n", "- **State**: Sampled from the initial distribution at the beginning of an episode and can then change on every step\n", "\n", "To work with the RLtools API, the environment data structure (`MyPendulum`) needs to have the fields:\n", "- `T`: Floating point type\n", "- `TI`: Index/unsigned integer type\n", "- `Parameters`: Parameters datastructure (can also purly contain compile-time `constexpr`), should be a [Plain Old Data (POD)](https://en.wikipedia.org/wiki/Passive_data_structure) structure so that it works well on GPUs and microcontrollers\n", "- `State`: State datastructure, should be a [Plain Old Data (POD)](https://en.wikipedia.org/wiki/Passive_data_structure) for the same reasons\n", "- `OBSERVATION_DIM`: Dimension of the observations\n", "- `ACTION_DIM`: Dimension of the actions" ] }, { "cell_type": "code", "execution_count": 2, "id": "7949edef-7557-4e14-9e76-1b5958b5ba77", "metadata": {}, "outputs": [], "source": [ "template \n", "struct MyPendulumParameters {\n", " constexpr static T G = 10;\n", " constexpr static T MAX_SPEED = 8;\n", " constexpr static T MAX_TORQUE = 2;\n", " constexpr static T DT = 0.05;\n", " constexpr static T M = 1;\n", " constexpr static T L = 1;\n", " constexpr static T INITIAL_STATE_MIN_ANGLE = -rlt::math::PI;\n", " constexpr static T INITIAL_STATE_MAX_ANGLE = rlt::math::PI;\n", " constexpr static T INITIAL_STATE_MIN_SPEED = -1;\n", " constexpr static T INITIAL_STATE_MAX_SPEED = 1;\n", "};\n", "\n", "template >\n", "struct MyPendulumSpecification{\n", " using T = T_T;\n", " using TI = T_TI;\n", " using PARAMETERS = T_PARAMETERS;\n", "};\n", "\n", "template \n", "struct MyPendulumState{\n", " static constexpr TI DIM = 2;\n", " T theta;\n", " T theta_dot;\n", "};\n", "template \n", "struct MyPendulumFourierObservation{\n", " static constexpr TI DIM = 3; // cos(theta), sin(theta), theta_dot\n", "};\n", "\n", "template \n", "struct MyPendulum{\n", " using SPEC = T_SPEC;\n", " using T = typename SPEC::T;\n", " using TI = typename SPEC::TI;\n", " using Parameters = typename SPEC::PARAMETERS;\n", " using State = MyPendulumState;\n", " using Observation = MyPendulumFourierObservation;\n", " using ObservationPrivileged = Observation;\n", " static constexpr TI ACTION_DIM = 1;\n", " static constexpr TI N_AGENTS = 1;\n", "};" ] }, { "cell_type": "markdown", "id": "2d3a16ec-e307-4918-916c-3fa21095b72f", "metadata": {}, "source": [ "Next we can start defining operations on these datastructures. Note that they should be in the `rl_tools` namespace so that the RLtools algorithms (such as the on-/off-policy runner) can find and dispatch to them. If you want to use functions outside the `rl_tools` namespace you can just implement proxy functions that call your arbitrary functions. In our case we do not need dynamic memory allocation or initialization, hence just implement them as a [NOP](https://en.wikipedia.org/wiki/NOP_(code)). The `sample_initial_state` function samples random initial states and the `initial_state` provides a deterministic initial state (for deterministic evaluations). In the case of the pendulum a reasonable choice for the latter could be e.g. the state where it is hanging downwards with zero velocity. " ] }, { "cell_type": "code", "execution_count": 3, "id": "b3bdeeea-3caa-4b79-85c5-76683abc9af6", "metadata": {}, "outputs": [], "source": [ "namespace rl_tools{\n", " template\n", " static void malloc(DEVICE& device, const MyPendulum& env){}\n", " template\n", " static void free(DEVICE& device, const MyPendulum& env){}\n", " template\n", " static void init(DEVICE& device, const MyPendulum& env){}\n", " template\n", " static void sample_initial_parameters(DEVICE& device, const MyPendulum& env, typename MyPendulum::Parameters& parameters, RNG& rng){ }\n", " template\n", " static void initial_parameters(DEVICE& device, const MyPendulum& env, typename MyPendulum::Parameters& parameters){ }\n", " template\n", " static void sample_initial_state(DEVICE& device, const MyPendulum& env, const typename MyPendulum::Parameters& parameters, typename MyPendulum::State& state, RNG& rng){\n", " state.theta = random::uniform_real_distribution(typename DEVICE::SPEC::RANDOM(), SPEC::PARAMETERS::INITIAL_STATE_MIN_ANGLE, SPEC::PARAMETERS::INITIAL_STATE_MAX_ANGLE, rng);\n", " state.theta_dot = random::uniform_real_distribution(typename DEVICE::SPEC::RANDOM(), SPEC::PARAMETERS::INITIAL_STATE_MIN_SPEED, SPEC::PARAMETERS::INITIAL_STATE_MAX_SPEED, rng);\n", " }\n", " template\n", " static void initial_state(DEVICE& device, const MyPendulum& env, const typename MyPendulum::Parameters& parameters, typename MyPendulum::State& state){\n", " state.theta = -rlt::math::PI;\n", " state.theta_dot = 0;\n", " }\n", "}" ] }, { "cell_type": "markdown", "id": "71103725-03e5-44dd-98f5-4779acaa44ec", "metadata": {}, "source": [ "In the following we define some helper functions. Note: the usage of `rlt::math::xxx` for math functions seems tedious over e.g. `std::xxx` but it allows to dispatch to the right math implementations on GPUs and microcontrollers and hence running the same code on any device. " ] }, { "cell_type": "code", "execution_count": 4, "id": "fda09574-1230-4a9e-b76e-5a21146beb00", "metadata": {}, "outputs": [], "source": [ "template \n", "T clip(T x, T min, T max){\n", " x = x < min ? min : (x > max ? max : x);\n", " return x;\n", "}\n", "template \n", "T f_mod_python(const DEVICE& dev, T a, T b){\n", " return a - b * rlt::math::floor(dev, a / b);\n", "}\n", "\n", "template \n", "T angle_normalize(const DEVICE& dev, T x){\n", " return f_mod_python(dev, (x + rlt::math::PI), (2 * rlt::math::PI)) - rlt::math::PI;\n", "}" ] }, { "cell_type": "markdown", "id": "255fce1d-7a17-44e0-9427-0ba3f7ce9db5", "metadata": {}, "source": [ "Next we implement the most important operations (which resemble the OpenAI gym interface): \n", "- `step`: Takes a `state`, executes an `action` and sets the `next_state`\n", "- `reward`: Returns the reward based on the `state`, `action`, and `next_state`\n", "- `observe`: Observes the `state`. For fully observed environments this should basically just flatten the `::State` data structure and possibly apply some observation noise. For partially observable environments the observation can also just contain parts of the information in the `::State`\n", "- `terminated`: Returns a boolean flag signalling if the `state` is a terminal state" ] }, { "cell_type": "code", "execution_count": 5, "id": "689e66c0-bfad-4042-b4a3-7eddd3addf07", "metadata": {}, "outputs": [], "source": [ "namespace rl_tools{\n", " template\n", " typename SPEC::T step(DEVICE& device, const MyPendulum& env, const typename MyPendulum::Parameters& parameters, const typename MyPendulum::State& state, const Matrix& action, typename MyPendulum::State& next_state, RNG& rng) {\n", " static_assert(ACTION_SPEC::ROWS == 1);\n", " static_assert(ACTION_SPEC::COLS == 1);\n", " typedef typename SPEC::T T;\n", " typedef typename SPEC::PARAMETERS PARAMS;\n", " T u_normalised = get(action, 0, 0);\n", " T u = PARAMS::MAX_TORQUE * u_normalised;\n", " T g = PARAMS::G;\n", " T m = PARAMS::M;\n", " T l = PARAMS::L;\n", " T dt = PARAMS::DT;\n", "\n", " u = clip(u, -PARAMS::MAX_TORQUE, PARAMS::MAX_TORQUE);\n", "\n", " T newthdot = state.theta_dot + (3 * g / (2 * l) * rlt::math::sin(device.math, state.theta) + 3.0 / (m * l * l) * u) * dt;\n", " newthdot = clip(newthdot, -PARAMS::MAX_SPEED, PARAMS::MAX_SPEED);\n", " T newth = state.theta + newthdot * dt;\n", "\n", " next_state.theta = newth;\n", " next_state.theta_dot = newthdot;\n", " return SPEC::PARAMETERS::DT;\n", " }\n", " template\n", " static typename SPEC::T reward(DEVICE& device, const MyPendulum& env, const typename MyPendulum::Parameters& parameters, const typename MyPendulum::State& state, const Matrix& action, const typename MyPendulum::State& next_state, RNG& rng){\n", " typedef typename SPEC::T T;\n", " T angle_norm = angle_normalize(device.math, state.theta);\n", " T u_normalised = get(action, 0, 0);\n", " T u = SPEC::PARAMETERS::MAX_TORQUE * u_normalised;\n", " T costs = angle_norm * angle_norm + 0.1 * state.theta_dot * state.theta_dot + 0.001 * (u * u);\n", " return -costs;\n", " }\n", "\n", " template\n", " static void observe(DEVICE& device, const MyPendulum& env, const typename MyPendulum::Parameters& parameters, const typename MyPendulum::State& state, const MyPendulumFourierObservation&, Matrix& observation, RNG& rng){\n", " static_assert(OBS_SPEC::ROWS == 1);\n", " static_assert(OBS_SPEC::COLS == 3);\n", " typedef typename SPEC::T T;\n", " set(observation, 0, 0, rlt::math::cos(device.math, state.theta));\n", " set(observation, 0, 1, rlt::math::sin(device.math, state.theta));\n", " set(observation, 0, 2, state.theta_dot);\n", " }\n", " template\n", " RL_TOOLS_FUNCTION_PLACEMENT static bool terminated(DEVICE& device, const MyPendulum& env, const typename MyPendulum::Parameters& parameters, const typename MyPendulum::State state, RNG& rng){\n", " using T = typename SPEC::T;\n", " return false;\n", " }\n", "}" ] }, { "cell_type": "markdown", "id": "d47b8b88-978a-490c-a1f2-b2b0f07e4fae", "metadata": {}, "source": [ "Since the training functions for the RL algorithms need to execute these operations they need to be defined before the RL data-collection and training operations. Hence in the following we include the RL ([Loop Interface](./07-The%20Loop%20Interface.ipynb)) operations. Note: when setting up your project you might want to assemble the previous data-structure definitions and operations into a header file so that all the `#include` directives are at the beginning of your code (still remember to include the header files for your environment before the RL operations). A recommended structure (that RLtools follows internally as well) is to put the the environment (for this example) into a `my_pendulum` directory. Then the datastructures are in `my_pendulum/my_pendulum.h` and the operations are in `my_pendulum/operations_generic.h`. `operations_generic.h` means that these are pure C++, dependency-free operations that can run on any device. If you need to use external libraries (e.g. `std::xxx` or `nlohmann::json`) you should separate out these operations into a device specific header, e.g. `my_pendulum/operations_cpu.h`." ] }, { "cell_type": "code", "execution_count": 6, "id": "850a41e6-a1ad-4d37-ac06-298d6478bbf1", "metadata": {}, "outputs": [], "source": [ "#include \n", "#include \n", "#include \n", "#include " ] }, { "cell_type": "markdown", "id": "95dcf94b-4ac6-4e7e-829d-9a00a9325ff3", "metadata": {}, "source": [ "Finally, we can use our new environment and train it using the [Loop Interface](./07-The%20Loop%20Interface.ipynb) (same as in the previous chapter):" ] }, { "cell_type": "code", "execution_count": 7, "id": "5ef35895-a619-4fb7-937d-5ce950c42c33", "metadata": {}, "outputs": [], "source": [ "using DEVICE = rlt::devices::DEVICE_FACTORY<>;\n", "using RNG = decltype(rlt::random::default_engine(typename DEVICE::SPEC::RANDOM{}));\n", "using T = float;\n", "using TI = typename DEVICE::index_t;" ] }, { "cell_type": "code", "execution_count": 8, "id": "75070c5b-08de-4a14-8c22-5cf5226dbba4", "metadata": {}, "outputs": [], "source": [ "using PENDULUM_SPEC = MyPendulumSpecification>;\n", "using ENVIRONMENT = MyPendulum;" ] }, { "cell_type": "code", "execution_count": 9, "id": "4d7817f0-2cd2-4b21-ae37-fbc977c372c7", "metadata": {}, "outputs": [], "source": [ "struct LOOP_CORE_PARAMETERS: rlt::rl::algorithms::ppo::loop::core::DefaultParameters{\n", " static constexpr TI EPISODE_STEP_LIMIT = 200;\n", " static constexpr TI TOTAL_STEP_LIMIT = 300000;\n", " static constexpr TI STEP_LIMIT = TOTAL_STEP_LIMIT/(ON_POLICY_RUNNER_STEPS_PER_ENV * N_ENVIRONMENTS) + 1; // number of PPO steps\n", "};\n", "using LOOP_CORE_CONFIG = rlt::rl::algorithms::ppo::loop::core::Config;" ] }, { "cell_type": "code", "execution_count": 10, "id": "8443382f-f275-4262-9a70-bf9eb23b6670", "metadata": {}, "outputs": [], "source": [ "template \n", "struct LOOP_EVAL_PARAMETERS: rlt::rl::loop::steps::evaluation::Parameters{\n", " static constexpr TI EVALUATION_INTERVAL = 4;\n", " static constexpr TI NUM_EVALUATION_EPISODES = 10;\n", " static constexpr TI N_EVALUATIONS = NEXT::CORE_PARAMETERS::STEP_LIMIT / EVALUATION_INTERVAL;\n", "};\n", "using LOOP_CONFIG = rlt::rl::loop::steps::evaluation::Config>;\n", "using LOOP_STATE = typename LOOP_CONFIG::template State;" ] }, { "cell_type": "code", "execution_count": 11, "id": "afcc818f-34bb-44cc-ae45-1887b5225237", "metadata": {}, "outputs": [], "source": [ "DEVICE device;\n", "TI seed = 1;\n", "LOOP_STATE ls;\n", "rlt::malloc(device, ls);\n", "rlt::init(device, ls, seed);\n", "ls.actor_optimizer.parameters.alpha = 1e-3; // increasing the learning rate leads to faster training of the Pendulum-v1 environment\n", "ls.critic_optimizer.parameters.alpha = 1e-3;" ] }, { "cell_type": "code", "execution_count": 12, "id": "7719b694-13ce-4af2-87d6-becae738b0ce", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Step: 0/74 Mean return: -1300.94 Mean episode length: 200\n", "Step: 4/74 Mean return: -1230.49 Mean episode length: 200\n", "Step: 8/74 Mean return: -1117.24 Mean episode length: 200\n", "Step: 12/74 Mean return: -702.198 Mean episode length: 200\n", "Step: 16/74 Mean return: -545.253 Mean episode length: 200\n", "Step: 20/74 Mean return: -417.902 Mean episode length: 200\n", "Step: 24/74 Mean return: -212.429 Mean episode length: 200\n", "Step: 28/74 Mean return: -160.032 Mean episode length: 200\n", "Step: 32/74 Mean return: -156.78 Mean episode length: 200\n", "Step: 36/74 Mean return: -171.889 Mean episode length: 200\n", "Step: 40/74 Mean return: -146.42 Mean episode length: 200\n", "Step: 44/74 Mean return: -111.336 Mean episode length: 200\n", "Step: 48/74 Mean return: -196.376 Mean episode length: 200\n", "Step: 52/74 Mean return: -182.804 Mean episode length: 200\n", "Step: 56/74 Mean return: -160.231 Mean episode length: 200\n", "Step: 60/74 Mean return: -183.741 Mean episode length: 200\n", "Step: 64/74 Mean return: -413.776 Mean episode length: 200\n", "Step: 68/74 Mean return: -318.449 Mean episode length: 200\n" ] } ], "source": [ "while(!rlt::step(device, ls)){\n", "}" ] } ], "metadata": { "kernelspec": { "display_name": "C++17", "language": "C++17", "name": "xcpp17" }, "language_info": { "codemirror_mode": "text/x-c++src", "file_extension": ".cpp", "mimetype": "text/x-c++src", "name": "c++", "version": "17" } }, "nbformat": 4, "nbformat_minor": 5 }